MrParse: finding homologues in the PDB and the EBI AlphaFold database for molecular replacement and more

MrParse is a software package designed to aid decision making in molecular replacement (MR). It performs a sequence search to find search models, not only in the PDB, as would conventionally be performed, but also in the EBI AlphaFold database, and provides a series of analyses relevant to MR such as crystal pathology detection and sequence analysis.


Introduction
The dominant approach to solving the phase problem in crystallography is molecular replacement (MR). At the time of writing, 86% of crystal structures deposited in the Protein Data Bank (PDB; Burley et al., 2021) in 2021 were solved by this method. In MR, initial phase estimates are derived from the placement of a search model in the asymmetric unit, typically by successive rotation and translation steps (Scapin, 2013). Successful placement requires that the search model bear a sufficiently close structural resemblance to (part of) the target structure. Conventional MR typically deploys experimental PDB structures that are inferred to be homologous to the target structure (or one of its chains or domains). The inference of homology, from a significant result in a sequencebased database search with the target as a query, allows a reasonable supposition of structural similarity of the target and the PDB deposition, although this assumption can break down where a protein family can adopt distinct conformations. Furthermore, with distant homologues the degree of structural similarity between the target and the search model may be too low for successful placement, even with advanced maximumlikelihood-based methods (McCoy, 2004;McCoy et al., 2007;Read, 2001) and the use of methods to maximize their value Sammito et al., 2014).
Unconventional MR generally uses bioinformatics predictions to suggest or construct search models. Thus, a detailed consideration of the sequence properties of the target can help direct the structure-solution strategy (Pereira & Alva, 2021).
For example, a secondary-structure prediction can point to simple regular structural elements such as -helices (Rodríguez et al., 2012) or recurring tertiary packing features composed of several such elements (Sammito et al., 2013) as potential search models. Novel and divergent folds can also be explicitly predicted using ab initio modelling (also known as de novo, free or template-independent modelling). The first broadly successful algorithms in the field (Leaver-Fay et al., 2011;Xu & Zhang, 2012) used fragment-assembly approaches, limiting their application to relatively small targets. Limited accuracy also meant that their results often needed sampling across a range of ensembles and rational edits in order to succeed in MR (Rigden et al., 2008;Bibby et al., 2012). However, ab initio modelling methods have advanced with remarkable speed, first by exploiting the residue-contact information available from sequence alignments (see, for example, Marks et al., 2011) and then, dramatically, using bespoke deep neural networks (Senior et al., 2020;. CASP14 saw the stunning performance of Alpha-Fold2 (AF2), which in many cases produced predictions that resembled the target as closely as a different crystal form typically would . The value of predictions from AF2 and the AF2-inspired RoseTTAFold (Baek et al., 2021) as search models was quickly demonstrated, although some cases still required domain splitting or other editing Baek et al., 2021;McCoy et al., 2022;.
As ab initio modelling methods have advanced, so have the corresponding databases of structure predictions. Earlier efforts typically sampled uncharacterized fold space using Pfam domain definitions (Mistry et al., 2021) as a convenient foundation (Ovchinnikov et al., 2017;Lamb et al., 2019;Wang et al., 2019). Although Pfam domain boundaries inferred from sequence alignment alone are not always accurately defined, the entries in these databases could, especially with ensembling, succeed as search models (Simpkin et al., 2019). More recently, AF2 has been used to model complete sequences of 21 whole proteomes, including the human proteome , and the results have been made available in the EBI AlphaFold database (AFDB; https:// alphafold.ebi.ac.uk/). The often high accuracy of the predictions (and they are accompanied by high-quality residue-level error estimates) makes the database a very significant new source of search models for MR.
Here, we present MrParse, which addresses a number of issues in MR. It will find and rank search models from both the PDB and the AFDB, providing convenient visualization of the results. It also guides choices in unconventional MR through secondary-structure prediction and predictions of regions that are relevant to MR strategy such as coiled coils (Thomas et al., , 2020Caballero et al., 2018) and transmembrane helices. When MrParse is provided with diffraction data information it can flag the crystal pathologies that can hinder successful MR (Sevvana et al., 2019;Caballero et al., 2021) and rank experimental homologues from the PDB according to eLLG (Oeffner et al., 2018), which is a good predictor of their suitability as search models.

Reflection data classification
If a reflection file is provided, MrParse creates a table providing information from the reflection file (resolution and space group) and information about the crystal pathology calculated with CTRUNCATE (Evans, 2011) (noncrystallographic symmetry, twinning and anisotropy).

PDB search
MrParse uses phmmer (Eddy, 2011) to search either the full PDB or a 95% sequence identity redundancy-reduced version of it, as provided by MrBUMP . Phmmer also provides information about the regions in the target protein that the hits correspond to. This is used to create a visualization of the search results using Pfam Domain Graphics (Mistry et al., 2021), which allows easy interpretation of how much of the target the search model covers. If a reflection file is provided, Phaser (Oeffner et al., 2018) is used to calculate the eLLG for each of the hits identified by phmmer. It has been shown that eLLG is a better indicator of whether a search model will succeed in MR than sequence identity (Oeffner et al., 2018). Therefore, when a reflection file is provided the search results are ranked by eLLG. Any hits are downloaded from the PDB and trimmed according to their match to the target sequence.

Protein classification
MrParse performs protein classification analysis on the input sequence to predict secondary structure, transmembrane regions and coiled-coil regions. Secondary structure is predicted using the JPred4 (Drozdetskiy et al., 2015) RESTful Application Programming Interface (API), transmembrane regions are predicted by TMHMM (Krogh et al., 2001) and coiled-coil regions are predicted by DeepCoil (Ludwiczak et al., 2019). Currently, coiled-coil and transmembrane predictions require local installations of TMHMM and DeepCoil.

EBI AlphaFold database search
MrParse uses phmmer to search the sequence database provided by the EBI AlphaFold database (https:// alphafold.ebi.ac.uk). As in the PDB search, information from phmmer is used to create a visualization of the search results using Pfam Domain Graphics. For the EBI AlphaFold database, these visualizations are coloured by Predicted Local Distance Difference Test (pLDDT) on an orange to blue scale, where orange indicates very low confidence in the model and blue indicates very high confidence in the model. Additional information is provided about the quality of the AF2 models, including the average pLDDT and a new measure of structural quality called the H-score.
The H-score can be calculated with the following equation, where N represents a list of pLDDT scores and jNj represents the number of elements in N, H-score ¼ maxfa n ; n ¼ 1; 2; 3; . . . ; 100 with a n ig: Any hits are downloaded from the database and trimmed according to their match to the target sequence, and the pLDDT scores are converted into estimated B factors using an algorithm developed for phaser.voyager (Claudia Millán; https://gitlab.developers.cam.ac.uk/scm/haematology/readgroup/ phaser_voyager/-/blob/master/src/Voyager/MDSLibraries/pdb_ structure.py). Interpreting pLDDT as B factors improves the likelihood of success in MR by downweighting the less reliable regions of the model (Croll et al., 2019). At the time of writing, calculation of eLLGs for AFDB entries is not possible since their coordinate error with respect to the unknown target cannot be reliably estimated: it will have two elements, intrinsic modelling error and the error resulting from the target and search model, likely with a relationship defined by a degree of sequence (and hence structural) divergence.

Examples
3.1. Interpreting the MrParse report page Fig. 1 shows an example of a MrParse report page generated from the reflection data and sequence data for PDB entry 5lm4. Here, we use PDB entry 5lm4 to demonstrate how to interpret the results of an MrParse run.
3.1.1. HKL info. The 'HKL info' panel ( Fig. 1, red) allows us to assess whether there are any crystal pathologies that might make MR more difficult. For example, the detection of translational noncrystallographic symmetry can be important for successful MR (Caballero et al., 2021). In the case of PDB entry 5lm4, we have a 2.69 Å resolution data set which shows anisotropy. Phaser can be used to correct anisotropic data and performs this step automatically in its autoMR pipeline .
3.1.2. Experimental structures from the PDB. The 'Experimental structures from the PDB' panel ( Fig. 1, teal) provides information about homologues identified by phmmer. In this example, we can see that we have identified three near-full-length matches when looking at the visualization of regions on the right-hand side (PDB entries 6s3q, 6mp6 and 6rvx). These hits all have high sequence identity to our target (65%, 66% and 64%, respectively) and give high eLLG scores (1135.1, 1092.3 and 1014, respectively). When eLLG is much greater than 60, structure solution by MR is likely to be straightforward (Oeffner et al., 2018); therefore, we can be fairly confident that these search models will work in MR. Further down the list of hits it can be seen that the target seems to match experimental structures in two distinct regions, which are likely to correspond to structural domains. Any matches are downloaded from the PDB and trimmed to match the target sequence. These are downloaded into the homologues subdirectory in the MrParse run directory.
3.1.3. Sequence-based predictions. The 'Sequence based predictions' panel ( Fig. 1, purple) provides secondary-structure, transmembrane and coiled-coil predictions. In this example, JPred4 predicts a large number of helices and TMHMM predicts several transmembrane regions. For a highresolution data set that is predicted to be predominantly helical, an approach such as AMPLE helical ensembles (Sá nchez Rodríguez et al., 2020) or ARCIMBOLDO (Rodríguez et al., 2012) can be used. If coiled coils were predicted, AMPLE and ARCIMBOLDO also have coiled-coil specific modes that can be tried (Thomas et al., 2020;Caballero et al., 2018).
3.1.4. Structure predictions from the EBI AlphaFold database. The 'Structure predictions from the EBI AlphaFold database' panel ( Fig. 1, blue) provides information about AF2 models identified by phmmer in the AFDB. In this example, we can see a large number of AF2 hits. These hits are largely very high quality, with an average pLDDT score of >80 for all of the hits. The visualization on the right-hand side shows the regions that the models correspond to and provides information about predicted model reliability at a residue level. For example, the few models that match the C-terminal region of the target structure (P24942, P43003 and D7RVS0) all have lower predicted reliability in this region. Any matches are downloaded from the AFDB and trimmed to match the target sequence and undergo a pLDDT to estimated B-factor conversion to improve their performance in MR. These are downloaded into the models subdirectory in the MrParse run directory.

Use of an AFDB entry for MR when a PDB search model is lacking
PDB entry 7dry is a crystal structure of Aspergillus oryzae Rib2 deaminase experimentally determined by Zn-SAD . A phmmer search of the PDB only identified a single hit (PDB entry 2cvi) that only covers a 71residue region of the target protein with 31% sequence identity (Figs. 2a and 2b). This homologue was insufficiently similar to the target protein to succeed in MR. A search of the EBI AlphaFold2 database identified a number of models that covered a larger region of the target protein and with a higher sequence identity. MR with the model of Q12362, the best hit ranked by H-score (Figs. 2a and 2c), was successfully placed by Phaser (LLG =173, TFZ = 15.4) and rebuilt with Buccaneer (Cowtan, 2006; R factor = 0.23, R free = 0.25).

Discussion
A crystallographer attempting to solve a macromolecular crystal structure by MR should be aware of the existence of any crystal pathologies and has an increasing range of searchmodel options to choose from. MrParse is designed to bring together a range of relevant information in a single place and present it with useful visualizations and sortable tables. For most effective use, it expects both diffraction data and a target sequence, but it can run without the former. Conventional MR using homologous structures identified in the PDB is supported by the presentation of potential search models, Figure 1 Highlighted sections of an MrParse report page. In red is information on the input reflection file, including resolution, space group and crystal pathology. In teal is information about the PDB entries identified by phmmer and visualizations of the matches. In purple is the protein classification report; this includes a secondary-structure prediction, a coiled-coil prediction and a transmembrane prediction. Finally, in blue is information about the AlphaFold models identified by phmmer and visualizations of the matches coloured by pLDDT on an orange to blue scale, where orange indicates very low confidence in the model and blue indicates very high confidence in the model. discovered by phmmer, with graphics that illustrate their extent relative to the target and numerical data that illustrate their size and characteristics. In the future, more sensitive HHpred (Sö ding, 2005) sequence searching will be supported. With diffraction data supplied, search models are ordered by default by eLLG as a predictor of their relative utility in MR. At present, PDB files are available locally and through the CCP4i2 GUI (Potterton et al., 2018) and online through the CCP4 Cloud setting . In the future, options for inline composition of ensembles will be implemented. The PDB files, which are trimmed according to their match to the target sequence and modified to convert the predicted residue error into a B factor (Claudia Millá n; https://gitlab.developers.cam.ac.uk/scm/haematology/readgroup/ phaser_voyager/-/blob/master/src/Voyager/MDSLibraries/pdb_ structure.py), can be fed directly to programs such as Phaser  or MOLREP (Vagin & Teplyakov, 2010) or may, in more difficult cases, require special treatment  The closest match in the PDB (PDB entry 2civ, blue) aligned with the crystal structure (PDB entry 7dry, grey). (c) The closest match in the EBI AlphaFold database (Q12362, coral) aligned with the crystal structure (PDB entry 7dry, grey). (Vagin & Teplyakov, 2010;Rigden et al., 2018;Simpkin et al., 2019;Sammito et al., 2014). The also well established use of secondary-structure elements as search models (Rodríguez et al., 2012), especially at higher resolution, is also facilitated by secondary-structure prediction that enables, for example, helpful predictions of likely helix lengths (Rodríguez et al., 2012).
Perhaps the most exciting and forward-facing aspect of MrParse is its discovery of structure predictions, especially those generated by ab initio (also known as de novo or template-independent) methods. The potential of these methods for MR of targets with novel or divergent folds has been recognized for some time (Rigden et al., 2008;Bibby et al., 2012;Qian et al., 2007). Nevertheless, their (until recently) considerable CPU demands and specialist software have undoubtedly proved offputting to structural biologists, despite the convenience offered by some servers . In addition, the accuracy of ab initio methods has historically not always been sufficient for MR and only smaller proteins were tractable using the earliest methods. This picture has changed rapidly in recent years with first AlphaFold (Senior et al., 2020) and then AlphaFold2 , each providing a step-change in model accuracy. These developments have been mirrored in online databases of ab initio structure predictions. Databases based on earlier methods such as GREMLIN (Ovchinnikov et al., 2017), PconsFam (Lamb et al., 2019) and C-QUARK (Wang et al., 2019) typically modelled single representatives of Pfam families. These provided useful sampling of uncharacterized protein fold space, sometimes being suitable for MR (Simpkin et al., 2019), but were limited by the fact that the domain boundaries of Pfam entries are not always, in the absence of some kind of structural information, accurately determined from sequence analysis (Bateman et al., 2010). The AFDB, in contrast, includes full-length models from 21 essentially complete proteomes, with the ambition to cover UniRef90 (Suzek et al., 2015), so that no protein of interest will be less than 90% identical to an entry in the database, by the end of 2021. Models in the AFDB are likely to be much more accurate than models available elsewhere, and are accompanied by accurate residue-level error estimates. Their availability therefore has profound implications for the choice of crystallographic phasing method (Kryshtafovych et al., 2021;McCoy et al., 2022) and the already very high market share of MR will only increase further.
MrParse currently provides a second graphical panel devoted solely to matches in the AFDB. These can be ranked by clicking on column headings for two measures of model quality: the novel H-score described here or the percentage sequence identity between the protein of interest and the model. While the experience of the CASP structure-prediction experiment suggests that many models serve, unaltered, as successful search models, downstream editing of models after retrieval via MrParse will sometimes be necessary (McCoy et al., 2022;Millá n et al., 2021;. This can eliminate regions with low predicted accuracy (McCoy et al., 2022) or sample a variety of truncated versions , or excise domains from multi-domain models, recognizing that inter-domain packing remains a challenge for AF. Future work will undoubtedly address the automatic identification or ranking of AFDB-derived search models, for example recognizing that small but very accurate substructures may be suitable search models where high-resolution diffraction data are available (McCoy et al., 2017). Furthermore, a systematic exploration of the characteristics of AFDB entries and their ability to predict coordinate error with respect to a given target, as performed with PDB entries (Hatti et al., 2020), will also be highly valuable.
Presently, hits are found by a phmmer (Eddy, 2011) search against a local database containing the sequences of entries in the AFDB. With the ambitious plans to expand the AFDB, this arrangement becomes increasingly awkward as everlarger databases would need to be distributed with CCP4. Happily, the 3D-Beacons initiative (Orengo et al., 2020) will shortly be launching an API for sequence-based retrieval of models not only from the AFDB but also from a variety of other resources containing protein structure predictions. Thus, we envisage that the importance of MrParse in facilitating access to a wide range of potential MR search models, both experimental structures and predictions, will only grow in the future. In addition, its ability to search AFDB in particular and conveniently visualize the results is likely to prove useful to bioinformaticians and cryo-EM researchers (Kryshtafovych et al., 2021;Simpkin et al., 2021) as well as to crystallographers.